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Abstract 


This paper presents Etna, an algorithm for atomic reads 
and writes of replicated data stored in a distributed hash 
table. Etna correctly handles dynamically changing sets 
of replica hosts, and is optimized for reads, writes, and 
reconfiguration, in that order. 


Etna maintains a series of replica configurations as 
nodes in the system change, using new sets of repli- 
cas from the pool supplied by the distributed hash table 
system. It uses the Paxos protocol to ensure consensus 
on the members of each new configuration. For simplic- 
ity and performance, Etna serializes all reads and writes 
through a primary during the lifetime of each configu- 
ration. As a result, Etna completes read and write oper- 
ations in only a single round from the primary. 


Experiments in an environment with high network de- 
lays show that Etna’s read latency is determined by 
round-trip delay in the underlying network, while write 
and reconfiguration latency is determined by the trans- 
mission time required to send data to each replica. 
Etna’s write latency is about the same as that of a 
non-atomic replicating DHT, and Etna’s read latency is 
about twice that of a non-atomic DHT due to Etna as- 
sembling a quorum for every read. 


1 Introduction 


Distributed hash tables (DHTs) provides a scal- 
able way to store and retrieve data among a large and 
dynamic set of participating host nodes. Most existing 
DHTs provide good support for immutable data. How- 
ever, DHTs that provide fault-tolerant mutable data typ- 
ically provide no consistency guarantees. There are an 
increasing number of applications built on top of DHTs 
that require stronger consistency. These include sys- 
tems for messaging [7], sharing read/write files [16], re- 
solving names of web objects [26], maintaining bulletin 
boards [21], and searching large text databases [24]. All 


of the cited examples depend on a DHT to store and 
replicate their data, and all of them either assume mu- 
table DHT storage or could be simplified if consistent 
mutable data were supported. 


Existing work on reconfigurable atomic memory ser- 
vice, RAMBO [8], is suitable for a dynamic set of 
participants, and could be used to provide consistent 
mutable data in a DHT. However, RAMBO allows mul- 
tiple active configurations of replicas at any time. As a 
result, reads and writes in RAMBO can be costly, since 
RAMBO has to assemble a quorum in every active con- 
figuration. 


Forseeing the need for efficient and consistent muta- 
ble DHT data, we present Etna, a new algorithm for 
atomic read/write replicated DHT objects. Etna guar- 
antees atomicity regardless of network behavior; for ex- 
ample, it will not return stale data during network parti- 
tions. It maintains one consistent configuration per ob- 
ject. Hence, different objects are replicated on different 
set of nodes, which makes Etna scalable. Etna uses a 
succession of configurations to ensure that only the sin- 
gle most up-to-date quorum of replicas can execute op- 
erations. Etna handles configuration changes using the 
Paxos distributed consensus algorithm [11]. Etna is de- 
signed for low message complexity in the common case 
in which reads and writes are more frequent than recon- 
figurations: both reads and writes involve only a single 
round of communication. 


We have implemented Etna on top of the Chord DHT. 
Experiments in an environment with high network de- 
lays show that Etna’s read latency is determined by 
round-trip delay in the underlying network, while write 
and reconfiguration latency is determined by the trans- 
mission time required to send data to each replica. 
Etna’s write latency is about the same as that of a 
non-atomic replicating DHT, and Etna’s read latency is 
about twice that of a non-atomic DHT due to Etna as- 
sembling a quorum for every read. 


This paper contains two primary contributions. First, 
we introduce the first complete design and imple- 
mentation of an atomic update algorithm in a com- 
plete DHT [4 Second, we provide experimental results 
demonstrating the performance of the working imple- 
mentation. 


The rest of this paper includes an overview of related 
work in Section[2] a description of our system model in 
Section |3} a summary of existing ideas and their inter- 
actions with Etna in Section|4] a description of the Etna 
algorithm in Section|5] a proof of atomicity in Section|6] 
a performance analysis in Section[7/] and a preliminary 
evaluation of an implementation in Section ??. 


2 Related Work 


A few DHT proposals address the issue of atomic data 
consistency in the face of dynamic membership. Ro- 
drigues et al. use a small configuration service 
to maintain and distribute a list of all the non-failed 
nodes. Since every participant is aware of the complete 
list of active nodes, it is easy to transfer and replicate 
data while ensuring consistency. However, maintaining 
global knowledge may limit the approach to small or 
relatively static systems. Etna uses the Chord DHT 
to manage the dynamic environment, and augments the 
basic service to guarantee robust, mutable data. 


There have been many quorum-based atomic read/write 
algorithms developed for static sets of replica hosts (for 
example [25}/2]). These algorithms assume that the par- 
ticipants are known in advance, and that the number of 
failures is bounded by a constant. 


Group communication services [9] and other virtually 
synchronous services [3] support the construction of ro- 
bust and dynamic systems. These algorithms provide 
stronger guarantees than are required for mutable data 
storage, implementing totally-ordered broadcast, which 
effectively requires consensus to be performed for ev- 
ery operation. As a result, the GCS algorithms work 
best in a low-latency LAN environment. Also, in most 
GCS systems, whenever a node joins or leaves, a new 
“view” (i.e., configuration) is created, leading to a po- 
tentially slow reconfiguration. Etna uses some of the 
reconfiguration techniques developed in the GCS pro- 
tocols. However the read and write operations in Etna 
require less communication than the multi-phase proto- 
col required to perform totally-ordered broadcast. Also, 
the rate of reconfiguration can be significantly reduced: 
a new configuration need only be created when a num- 
ber of replicas has failed (and no reconfiguration is nec- 


‘Source available atthttp: //pdos.1cs.mit.edu/chord 


essary as a result of join operations). 


Prior approaches for reconfigurable read/write mem- 
ory require that new quorums include pro- 
cessors from the old quorums, restricting the choice of 
a new configuration. Some earlier algorithms 6} 
rely on a single process to initiate all reconfigurations. 
The RAMBO algorithms [8], on the other hand, al- 
low completely flexible reconfiguration, and Etna takes 
a similar approach. However, RAMBO focuses on al- 
lowing reads and writes to proceed concurrently with 
reconfiguration, resulting in multiple active configura- 
tions. Etna, instead, optimizes read and write perfor- 
mance assuming that reconfigurations are rare, by only 
allowing one active configuration at any time. Hence, 
Etna needs to contact only a quorum in the active con- 
figuration during reads and writes, while RAMBO may 
have to assemble quorums from multiple active config- 
urations. As a result, read and write operations in Etna 
are much more efficient than those in RAMBO. 


Recent work have applied quorum-based techniques to 
dynamic systems. Abraham and Malkhi[1] apply prob- 
abilistic quorum techniques to a dynamic de Bruijn net- 
work. Naor and Wieder suggest a way to apply 
a quorum system to a dynamic two-dimensional DHT. 
Etna could make use of either of these techniques to 
choose consistent quorums. However, since our pri- 
mary goal is to provide a complete design and imple- 
mentation of atomic memory on a DHT, we choose 
to apply the quorum technique to Chord, which is a 
widely-deployed DHT that has the dynamic ring topol- 


ogy. 


3 System Model 


We assume a dynamic, cooperating set of nodes in 
a partially synchronous environment. Communication 
links may be arbitrarily slow. However, when making 
progress guarantees and theoretical performance anal- 
ysis, we assume that messages are delivered within a 
bounded time, d. Nodes have access to local clocks 
(which they use for timeouts), but the clocks are not 
necessarily synchronized. Nodes can crash (fail-stop), 
join or leave the system at any time. 


4 Background 


This section describes two components that Etna uses 
to implement atomic memory, see Figure|[I]for an illus- 
tration. 


Chord/DHash 


nodeID1 


succ(key)||noderp modeID2 || « successors (key) 


nodeIDk 
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object 
client 
Etna write(key,new_object) 
~<a 
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Figure 1: Interaction between Etna and other compo- 
nents. 


4.1 Chord 


Chord is an efficient, load-balanced DHT. It per- 
forms a lookup of an object, given a key, in O(log(n)) 
time, where 7 is the total number of nodes in the sys- 
tem. Chord arranges that each node knows its current 
successor, the node with the next highest ID; the ID 
space wraps around at zero. Etna uses Chord to provide 
it with the node whose ID immediately follows an ob- 
ject key. Etna also uses Chord to provide it with a set of 
nodes whose IDs are the closest successors of an object 
key. Etna does not rely on Chord to consistently iden- 
tify the successor node(s) for an object, since the set of 
nodes in Chord can rapidly change. Etna maintains a 
consistent series of replicas for an object regardless of 
Chord’s inconsistency. 


4.2 Paxos 


Etna uses the Paxos distributed consensus pro- 
tocol to determine a total-ordering of the replica con- 
figurations. We execute a single instance of the Paxos 
protocol for each configuration; Paxos then outputs 
the next configuration. All Paxos procedures are sub- 
scripted by c, the configuration that is associated with 
that instance of Paxos. 


Paxos is a three-phase commit protocol that implements 
consensus. As originally described, Paxos consists of 
two components: an eventual leader election service, 
and an agreement protocol. The eventual leader election 
service ensures that eventually only one node thinks that 
it is the leader (Gf the communication network is even- 
tually well-behaved). The agreement protocol allows a 
node that thinks it is the leader to propose a value. It 


guarantees that only one value is chosen at most, and 
that if eventually there is only one leader, then some 
value is chosen. In the first phase of the protocol, the 
leader queries for previously proposed decision values; 
the second phase chooses a decision value. 


Etna only makes use of the agreement protocol. When- 
ever Etna notices that the Chord successor for an object 
is not the same as the Etna primary, Etna uses Paxos to 
get a consensus on a new configuration in which they 
are the same. One or more Etna nodes create proposed 
configurations and pass them to Paxos: 


Paxos. propose(proposal_value ) . 
When consensus is reached, Paxos calls the Etna 
decide(decision_value) 


procedure. At this point, Etna notifies the new config- 
uration of the decision, and the configuration’s primary 
proceeds to locate the object’s latest version and serve 
client requests. Paxos guarantees the following: 


Theorem 1 (derived from [12]). For each in- 
stance, c, of the Paxos protocol, if one or more 
decide(decision_value),.. events occur, then all the val- 
ues of decision_value are the same. 


Tf, eventually, a Paxos.propose, occurs at some node 
i at time t, and no later Paxos.propose, occurs at 
any other node j, and node i does not fail, then a 
Paxos.decide(...)- occurs by time t + 2d, where d is the 
time it takes for a message to be delivered. 


Since Chord nodes form a dynamic network, its view 
of the current successor of a object can become incon- 
sistent. It is possible that two nodes think that they are 
a object’s current successor and simultaneously initiate 
the agreement protocol. This scenario does not violate 
Etna’s correctness, since Paxos guarantees that, given 
a configuration of replicas, it will always decide on at 
most one value. 


5 The Etna Algorithm 


In this section we present Etna, an algorithm that pro- 
vides fault-tolerant, atomic mutable data in a DHT. For 
each mutable object, Etna uses Paxos to maintain a con- 
sistent series of replica configurations in the face of dy- 
namic membership. Given a configuration of replicas, 
Etna designates a node as the primary and serializes all 
reads and writes through it. 


Per Node State 
status Flag variable, with values idle, active 
Fae or recon_inprog, initially idle 
A pair (version, primaryID), initially (0, 0) 


Latest object value, initially v. 


new-tag A pair (version, primaryID), initially (0,0), 


containing the largest tag of any ongoing 
write operation 


config Current configuration, with  sub-fields 
seqnum € X, initially 0, and nodes = 
(node1, node2,---), initially d 


Per Operation State 
Set of responses for the current operation, 
initially 0 


responses 
Figure 2: Fields in an Etna object 


5.1 Object State 


Etna provides atomicity for each mutable object, which 
extends to the entire DHT. By the composability of 
atomic objects [14], all mutable objects in the DHT 
form an atomic memory. Therefore, we describe our 
protocol in terms of a single object. 


To provide fault-tolerance, Etna replicates a mutable 
object at / different nodes, where & is the system-wide 
replication factor. We call this replica set a configura- 
tion of nodes responsible for an object. Etna initiates re- 
configurations in order to arrange that an object’s repli- 
cas are the nodes that immediately succeed the object’s 
key in the Chord ID space, and the object’s immedi- 
ate successor node is the primary in the configuration. 
For each object, a replica keeps a tag, which is a pair 
(version, primaryID); a status flag, which designates 
the phase of operation to the object; a new-tag, which 
is used during write operations; and a config variable, 
which contains the configuration sequence number and 
the IDs for the nodes in the object’s configuration. Etna 
increments the sequence number for each new config- 
uration. Figure |2| summarizes the fields in a mutable 
object. We refer to a mutable object as simply an object 
throughout the rest of the paper. 


5.2 Inserting a New Object 


To insert a new object, the writer passes the object’s data 
to the Etna client on the local machine. Etna extends the 
object with the fields in Figure[2] Because the object is 
new, Etna can directly insert it at the initial replicas, set 
to be the & immediate successors to the object’s ID. 


5.3. Read Protocol 


To read an object with key b/D, the reader sends a read 
RPC to the node 2 which is the immediate successor of 
bID. 7 checks if it believes it is the current primary of 


Network functions 


Sends a message from 7 to j 

Chord functions 
Returns the immediate successor 
of the object’s identifier on the 
Chord ring. 
Returns the k immediate succes- 
sors of the object’s identifier on 
the Chord ring. 

Etna function 

Returns the primary for a given 
configuration c 

Paxos functions 


propose(::- ) Proposes a new configuration. 


Figure 3: Auxiliary functions, provided by Chord, Etna, 
Paxos, and the network. 


succ() 


k._successors() 


primary (c) 


bID and if bID is not going through a reconfiguration 
(status = active). If both conditions are true, it sends 
a GET RPC to each node in config. If not, the replicas 
may be going through membership reconfiguration, so 
2 returns an error. The reader will retry periodically. 


When a node 7 receives a GET RPC for DID, it looks in 
config to see if the sender is the current primary of b/JD 
and if status = active. If both conditions are true, 7 
returns with a positive ack. If not, it returns an error. 


If i collects more than k/2 positive acks, it returns its 
own stored copy of the object value to the reader. If 2 
fails to assemble a majority after a certain time, it re- 
turns an error. Figure [4]shows the pseudocode for the 
read protocol. 


5.4 Write Protocol 


To write object bJD, the writer sends a write RPC to 
the successor of bID, node 7. 7 consults its local state 
to verify that it believes it is the current primary of bID 
and that status is active. If both conditions are true, 
i starts the write protocol: 


1. Node ¢ assigns a new version number to this write, 
giving it tag (new-tag.version + 1,7). 

2. Node 7 sends a put RPC to each replica node in 
config.nodes, including the write’s tag and value. 

3. When a node 7 receives a put RPC for DID, it 
ignores the RPC if status is not active or if the 
sender isn’t the primary. If the write’s tag is higher 
than the stored tag, 7 updates its stored object. This 
ensures that a replica is not confused if concurrent 
writes arrive from the primary out of order. Node 
j then returns an ack to the sender. 

4. When node 2 receives positive acks from a major- 
ity of the replicas, it updates its own copy of the 


Read protocol for the primary: 


Procedure recv(read)<,; 
if i = primary(config) then 
if status = active then 
responses — 
Vj € config.nodes do 
send(get, c, config.seqnum);,; 


Procedure recv(get-ack, c, seqnum) j,i 
if segnum = config.seqnum then 
responses — responses U {j} 


if | responses | > [ k/2 ] then 
if status = active then 
send(tag, value) i,c 


Read protocol for the replicas (including the primary): 


Procedure recv(get, c, seqnum);,: 
if status = active then 
if (segnum = config.seqnum) then 
send(get-ack, c);,; 


Figure 4: Pseudo-code for the read protocol. 


Write protocol for the primary: 


Procedure recv(write, c, new-val)¢,i 
if i = primary(config) then 
if status = active then 
new-tag — (new-tag.version + 1, i) 
op-object — (new-tag, new-val) 
responses — 0 
Vj € config.nodes do 
send(put, c, config.seqnum, op-object) :,; 


Procedure recv(put-ack, c, op-object, seqnum) ;,: 
if segnum = config.seqnum then 
responses — responses U {j} 
if | responses | > [ k/2 ] then 
if status = active then 
if op-object.tag > tag then 
tag — op-object.tag 
value — op-object.value 
send(put-ack);,. 


Write protocol for the replicas (including the primary): 


Procedure recv(put, c, seqnum, op-object) ;,i 
if status = active andi # j then 
if (seqnum = config.seqnum) then 
if op-object.tag > tag then 
tag — op-object.tag 
value — op-object.value 
send(put-ack, c, op-object, config.seqnum) ;,i 


Figure 5: Pseudo-code for the write protocol. 


object, though only if it has completed no subse- 
quent concurrent write. Node 7 then replies to the 
client. If 2 fails to assemble a majority after a cer- 
tain time, it returns an error. 


Figure [5] illustrates the write protocol. The primary as- 
signs increasing version numbers to writes in the order 
that they arrive at the primary, but issues the writes to 
the replicas in parallel. 


5.5 Reconfiguration Protocol 


Etna must change the configuration of nodes respon- 
sible for an object when a replica leaves (to maintain 
the & replication factor) and when a new node joins 
that would be the object’s successor (so that the pri- 
mary is the Chord successor and is thus easy to find). 
Etna maintains only one configuration at a time, rather 
than multiple configurations as in Rambo [13]; this al- 
lows Etna to be simpler and have higher performance 
in all but the highest-churn environments. Etna uses the 
Paxos distributed consensus protocol to decide on 
the next configuration. 


If an Etna node # notices that the set of Chord successors 
for a object bJD does not match the set of replicas in the 
object’s config, it tries to initiate a reconfiguration: 


Case I If 7 notices that it is the immediate successor of 
DID, it collects some information that will serve 
as a configuration proposal for a Paxos execution. 
The proposal has the form 
proposal_value = (new-config , object_copy). 
zi sets new_config to be the k immediate successors 
of bID. i sends a recon-get RPC to each node 
in config asking for its current object value and 
tag, waits for a majority of replicas to respond, 
and uses the most up-to-date response as the 
object_copy in the proposal. 


When a replica receives a recon-get RPC, it 
sets status to recon_inprog and stops process- 
ing all reads and writes. 


When i has assembled a majority of recon-get 
responses, it uses Paxos to propose its pro- 
posal_value to the nodes in the old configuration. 
Paxos calls the decide function at each node that 
proposed a new configuration, with the consensus 
configuration information; the proposer(s) send 
the new configuration and most up to date object 
to the replicas in the new configuration. 


Case II If 7 is not immediate successor of bID, i sends 


Reconfiguration protocol for the primary: 


Procedure recon(); 
if config.nodes # k_successors() then 
if i = succ() then 


ie) 


new-config — k_successors() 4 

status — recon_inprog 

proposed — false 6 

responses — 0) 

Vj € config.nodes 8 
send(recon-get);,; 


Procedure recv(recon-ack, new-tag, new-value, seqnum);,; 
if seqnum = config.seqnum then 12 
if new-tag > tag then 
tag <— new-tag 14 
value — new-value 
responses — responses U {j} 16 
if | responses | > [ k/2 | and proposed # true then 
proposed — true 18 
object-copy — (tag, value) 
Paxos.propose(new-config, object-copy) config 20 


Procedure decide(new-config, object-copy)c 22 
Vj € new-config.nodes 
send(update, new-config, object-copy)i,; 24 


Reconfiguration protocol for the replicas: 


Procedure recv(recon-get, seqnum);,i 
if (seqgnum = config.seqnum) then 25 
status — recon_inprog 
send(recon-ack, tag, value, config.seqnum) 27 


Procedure recv(update, new-config, object-copy) ;,i 29 
if i © new-config.nodes then 
if new-config.seqnum > config.seqnum then 31 
Status — active 
config <— new-config 33 
if object-copy.tag > tag then 
tag <— object-copy.tag 35 
value — object-copy.value 


Figure 6: Pseudo-code for reconfiguration. 


the tag, value, and config information to the im- 
mediate successor of bID, asking the successor to 
initiate a reconfiguration. 


Figure [6]shows the pseudo-code for the reconfiguration 
protocol. During reconfiguration, nodes in the current 
configuration become inactive (stops serving write and 
read requests). If reconfiguration fails, there will be no 
active configuration. Section[7|discusses liveness. 


6 Atomicity 


In this section, we show that Etna correctly implements 
an atomic read/write object. Throughout this section, 


we consider only a single object, b. Since atomic mem- 
ory is composable, this is sufficient to show that Etna 
guarantees atomic consisteny. We omit references to b 
for the rest of this section. 


We use the partial-ordering technique, described in 
Lynch [14]. We rely on the following lemma: 


Lemma 1 (Lemmas 13.16 and 13.10 in [14]}). Let a be 
any well-formed, finite execution of algorithm X (im- 
plementing a read/write atomic object) in which every 
operation completes. Let II be the set of all operations 
ina. 

Suppose that ~ is an irreflexive partial ordering of all 
the operations in II, satisfying the following properties: 


1. For any operation A € Ul, there are only finitely 
many operations B € II such that B ~ A. 

2. If A finishes before B starts, then it cannot be the 
case that B X< A. 

3. If A is a write operation in II and B is any opera- 
tion in II, then either A ~ Bor B ~ A. 

4. The value returned by each read operation is the 
value written by the last preceding write operation 
according to < (Or Vo, if there is no such write). 


Then every well-formed execution of algorithm X sat- 
isfies the atomicity property. 


We consider an arbitrary well-formed execution, a, of 
the Etna algorithm in which every read and write oper- 
ation completes. (Lemma 13.10 in Lynch indicates 
that it is sufficient to consider only such executions.) 
We first define a partial order on the read and write op- 
erations in a, and then show that this partial order has 
the properties required by Lemmal[I] Finally, we con- 
clude that Etna guarantees atomic consistency. 


Partial Order. We first order the read and write oper- 
ations in a based on their tags. For a read or write oper- 
ation A € a, initiated at node i, we define tag(A) = 
tag; immediately before the operation returns to the 
reader or writer; that is, tag(A) is the value of the ob- 
ject’s tag when 7 sends the result back to the client (Fig- 


ure [5] Line[18]and Figure |4] Line[13). 


For a reconfiguration operation, 77, we define 

tag(~) = object-copy.tag, immediately before the call 
to Paxos.propose (Figure [6] Line [20). We then define 
the partial order <: (i) For any two operations A and B 
in a: if tag(A) < tag(B), then A < B. (ii) For any 
write operation A in a, and any read operation B in a: 
if tag(A) = tag(B) then A < B. We show in Theo- 
rem [2] that it is straightforward to see that this partial- 


order satisfies Properties 1, 3, and 4 of Lemmal[I] The 
primary goal of the rest of this section is to show that 
this ordering satisfies Property 2. 


Atomicity Proof. Our first goal is to show that when 
a new configuration is installed, no information is lost. 
Let configuration cg be the initial configuration, and 
configuration cg, be the unique configuration decided 
on by Paxosy .,. (Theorem[]ensures that this is, in fact, 
unique.) If Paxosy,., does not terminate, then cy is un- 
defined. 


Recall that when Paxos is initiated, the process perform- 
ing the reconfiguration includes object-copy, the latest 
copy of the object, in the proposal. That is, Paxosy ,,, is 
initiated with at least one call to: 
Paxos.propose(new-_config , object_copy) ¢,c). 

Define tag(co) to be (0,0) and tag(ce41) to be 
object_copy.tag of the successful proposal to Paxos, .,. 
We sometimes refer to tag(ce;1) as the initial tag of 
configuration ce;1. We want to show that the initial tags 
of the configurations are nondecreasing: 


Lemma 2. Let cg and ce be configurations installed 
in a. Then tag(ce) < tag(ce+1). 


Proof. No replica in configuration cg can send any re- 
sponse until it has received an update message (Fig- 
ure [6] Lines [29} [36), which causes the replica to set 
its status to active. Therefore every response to the 
recon-get message (Figure [6] Lines [24} [27) during the 
recon that proposes configuration cg;; must include 
a tag no smaller than tag(cy). Therefore tag(cy) < 
object-copy.tag in the proposal from c+, from which 
the result follows. O 


It then follows immediately by induction: 


Corollary 1. Jf cg and cy, are two configurations in a, 
and ¢ < k, then tag(ce) < tag(cx). 


We next consider a read or write operation that occurs 
in configuration cy (for some £ > 0). We want to show 
that if A is an operation that completes in cp, then the 
value A returns has a tag no smaller than tag(ce). 


For read or write operation A, let conf(A) = 
config.seqnum when 7 sends the result back to the client 


(Figure [5] Line [I8]and Figure [4] Line[13). 


Lemma 3. Let B be a read or write operation in a, and 
assume that conf(B) = ¢. Then tag(c¢) < tag(B), and 
if B is a write operation then the inequality is strict. 


Proof. The reconfiguration to install configuration ce 
concludes when a primary, 7, wins a Paxos decision 
(Figure [6] Lines and sends messages to the 
new replicas. Notice that the decision includes the 
object-copy determined when the reconfiguration be- 
gan. Therefore, by the time the configuration is in- 
stalled, tag(ce) < tag;. As a result, every operation ini- 
tiated at node 7 has a tag greater than or equal to tag(ce) 
and with write operations the inequality is strict, since 
write increments the tag. O 


Next, we relate the tag of a read or write operation to 
the tag of the next configuration. We want to show that 
if a read or write operation completes, the information 
is transferred to the next configuration. 


Lemma 4. Let A be a read or write operation in a, 
and assume that conf(A) = &. If configuration c+ is 
installed in a, then tag(A) < tag(ce+1). 


Proof. First, notice that the value of the primary always 
reflects a write operation that has updated a majority of 
the replicas. Therefore if A is a read operation, there 
is a write operation, A’, that wrote the tag and value 
returned by A to a majority of the replicas. If A is a 
write operation, define A’ = A. 


Since operation A’ completes in configuration cp, there 
exists a set of at least [//2] nodes in configuration ce 
that a send a response for A’ to the primary. Call this 
set of nodes W (for “writers’’). 


Since configuration cz, is installed in a, there exists a 
set of at least [k/2] nodes in configuration cy that send 
a response to the recon-get message during the recon- 
figuration. Call this set of nodes R (for “readers’’). 


Notice that since there are k nodes in configuration cy, 
and both R and W contain at least k/2 nodes, there is 
at least one node, j, in both R and W. Node 7 sends 
a response both for operation A’ and for the successful 
reconfiguration resulting in cp+1. 


We claim that node 7 sends the response for operation 
A’ before the response for recon-get. As soon as node 
Jj sends a response for a recon-get, it sets its status; to 
recon-in-progress, at which point it ceases responding 
to requests. Since we know that 7 sends a response to 
operation A’, it must send this response prior to the first 
time it receives a recon-get request. 


We conclude, then, that the primary sends its put re- 
quest for A’ to replica J prior to 7 sending its response to 
the recon-get request. Therefore, tag(A) = tag(A’) < 
1ag(cex1). O 


In the final preliminary lemma, we show that the config- 
urations used by operations are non-decreasing. That is, 
if operation A occurs in one configuration, then a later 
operation B cannot occur in an earlier configuration. 


Lemma 5 (sketch). Let A and B be two read or write 
operations in a. Assume that operation A completes 
before operation B begins. Then conf(A) < conf(B). 


Proof. If A completes in configuration cy, then some 
reconfiguration completes prior to A for ce. During that 
reconfiguration, a majority of replicas in configuration 
Ce_1 Were sent a recon-get message notifying them to 
cease processing read and write requests. By induction, 
a majority of replicas from al/ earlier configurations re- 
ceived such messages. Therefore operation B can not 
complete after A using an earlier configuration. O 


Finally, we relate read and write operations. 


Lemma 6. Let A and B be two read and write oper- 
ations in a where A completes before B begins. Then 
tag(A) < tag(B), and if B is a write operation then the 
inequality is strict. 


Proof. We break the proof down into two cases: (i) A 
and B complete in the same configuration, and (ii) A 
completes in an earlier configuration than B. Lemmaf5] 
shows that A cannot complete in a later configuration 
than B. 


First, assume that & = conf(A) = conf(B). Let node 
i be the primary of configuration c,. Both operations 
originate at node 7. Therefore, when operation B be- 
gins, the tag of 7 is at least as large as tag(A). If B 
is a write operation, then 7 increments the tag, and the 
inequality is strict. 

Next, consider the case where conf(A) < conf(B). No- 
tice that tag(A) < tag(conf(A)), by Lemma[4] Next, 
notice that tag(conf(A)) < tag(conf(B)), by Corol- 
lary] Third, notice that tag(conf(B)) < tag(B), and 
if B is a write operation, the inequality is strict, by 
Lemma[3] Combining the inequalities implies the de- 
sired result. O 


Finally, we prove the main theorem: 


Theorem 2. The Etna algorithm correctly implements 
an atomic read/write object. 


Proof. We show that the protocol satisfies the four con- 
ditions of Lemma[I] For an arbitrary execution a in 
which every read and write operation completes, we 


demonstrate that the partial-ordering, <, satisfies Prop- 
erties 1-4 of Lemma[I] 


1. Immediate. 

2. It follows from Lemma|6]that if A completes be- 
fore B begins, then tag(A) < tag(B). Therefore 
BAA. 

3. If A and B are write operations, then it follows 

immediately that tag(A) 4 tag(B), since the tags 
are unique (as they consist of a sequence number 
and a node identifier to break ties). 
If A is a write operation and B is a read operation 
and tag(A) = tag(B) then A < B. Otherwise, if 
tag(A) # tag(B), then either A < B or B < A, 
depending on whether A or B has a larger tag. 

4. This follows by the definition of the partial order. 
If B is a read operation, then tag(B) is the tag of 
the write operation, A, whose value B returns, or 
the initial tag. Therefore, either A is the last pre- 
ceding write operation (since tag(A) = tag(B) or 
B returns vo. 


O 


7 Theoretical Performance 


As in all quorum based algorithm, the performance of 
the algorithm depends on enough replicas remaining 
alive. We assume that if a node crashes, the remain- 
ing live nodes in the relevant configurations notice the 
crash and reconfigure quickly enough to maintain a live 
majority. If a majority of the nodes in a configuration 
fail, then operations can no longer complete. A recov- 
ery protocol could attempt to collect the values from the 
remaining replicas, at the expense of atomicity; we do 
not address recovery from failed configurations in this 
paper. 

During intervals in which the primary does not fail, the 
algorithm is efficient. A write operation requires a sin- 
gle round of communication to propagate the new value 
to a quorum. A read operation also requires only a sin- 
gle round of communication, involving only small con- 
trol messages, since the primary supplies the data. For 
the purpose of this section, we assume that each mes- 
sage is delivered in time d: 


Lemma 7. If a read or write operation begins at time 
t, and the primary does not fail by time t + 2d, then the 
operation completes by time t + 2d. 


A reconfiguration is somewhat more expensive, requir- 
ing three and a half rounds of communication. As soon 
as a new primary is established, it queries the old repli- 
cas for the latest value of the block. It then begins 


Paxos, which requires two rounds of communication to 
arrive at a decision. Finally, it updates the new con- 
figuration. In our implementation, we piggy-back the 
recon-get RPC with the first RPC of Paxos, reduc- 
ing the communication to two and a half rounds. The 
following lemma reflects this optimization. 


Lemma 8. /f node i is designated the primary at time 
t, and i does not fail by time t + 5d, then the new con- 
figuration is installed by time t + 5d. 


Recall that when a reconfiguration takes place, ongoing 
read and write operations may fail to complete. We as- 
sume that, in this case, the client retries the operation 
at the new primary. Combining the two previous lem- 
mas, we see that even if a primary fails, a read or write 
operation completes within 7d after a new primary is 
designated. 


8 Experimental Evaluation 


We are in the process of conducting more extensive ex- 
periments. 


9 Conclusion 


This paper describes Etna, an algorithm for atomic mu- 
table blocks in a distributed hash table. Etna correctly 
handles a dynamically changing set of replica hosts, us- 
ing protocols optimized for situations in which reads 
are more common than writes or replica set changes. 
Etna uses Paxos to agree on a sequence of replica con- 
figurations, and only allows operations when a majority 
of replicas from the active configuration are available. 
Etna’s write latency is comparable to that of non-atomic 
replicated DHTs, and its read latency is approximately 
twice that of a DHT. 
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